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Abstract 

Scoring  rules  assess  the  quality  of  probabilistic  forecasts,  by  assigning  a  numerical 
score  based  on  the  forecast  and  on  the  event  or  value  that  materializes.  A  scoring  rule  is 
strictly  proper  if  the  forecaster  maximizes  the  expected  score  for  an  observation  drawn 
from  the  distribution  F  if  she  issues  the  probabilistic  forecast  F,  rather  than  any  G  ^  F. 
In  prediction  problems,  strictly  proper  scoring  rules  encourage  the  forecaster  to  make 
careful  assessments  and  to  be  honest.  In  estimation  problems,  strictly  proper  scoring 
rules  provide  attractive  loss  and  utility  functions  that  can  be  tailored  to  the  scientific 
problem  at  hand. 

This  paper  characterizes  strictly  proper  scoring  rules  on  general  probability  spaces, 
and  proposes  and  discusses  examples  of  such.  In  the  case  of  categorical  and  binary  vari¬ 
ables,  a  rigorous  version  of  the  Savage  representation  is  established.  Examples  of  scoring 
rules  for  probabilistic  forecasts  in  the  form  of  predictive  densities  include  the  spherical, 
pseudospherical,  logarithmic  and  quadratic  score.  The  continuous  ranked  probability 
score  applies  to  probabilistic  forecasts  that  take  the  form  of  predictive  cumulative  dis¬ 
tribution  functions;  it  generalizes  the  absolute  error  and  forms  a  special  case  of  a  new 
and  very  general  type  of  score,  the  energy  score.  Proper  scoring  rules  for  quantile  and 
interval  forecasts  are  also  discussed.  We  relate  proper  scoring  rules  to  Bayes  factors  and 
to  cross-validation,  and  show  that  a  particular  form  of  cross-validation,  random-fold 
cross- validated  likelihood,  corresponds  to  a  proper  scoring  rule.  This  also  allows  us  to 
define  proper  scoring  rules  when  parameters  defining  the  rule  are  estimated  from  the 
data. 

A  case  study  on  probabilistic  weather  forecasts  in  the  North  American  Pacific  North¬ 
west  illustrates  the  importance  of  strict  propriety.  Optimum  score  approaches  to  point 
estimation  are  noted,  and  the  intuitively  appealing  interval  score  is  proposed  as  a  utility 
function  in  interval  estimation  that  addresses  width  as  well  as  coverage. 
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1  Introduction 


One  of  the  major  purposes  of  statistical  analysis  is  to  make  forecasts  for  the  future,  and  to 
provide  suitable  measures  of  the  uncertainty  associated  with  them.  Consequently,  forecasts 
should  be  probabilistic  in  nature,  taking  the  form  of  probability  distributions  over  future 
events  (Dawid  1984).  Indeed,  over  the  past  two  decades  probabilistic  forecasting  has  become 
routine  in  applications  such  as  weather  prediction  (Palmer  2002;  Gel,  Raftery  and  Gneiting 
2004)  and  macroeconomic  forecasting  (Garratt,  Lee,  Pesaran  and  Shin  2003).  Gneiting, 
Raftery,  Balabdaoui  and  Westveld  (2003)  contend  that  the  goal  of  probabilistic  forecasting 
is  to  maximize  sharpness  subjeet  to  ealibration.  Calibration  refers  to  statistical  consistency 
between  the  distributional  forecasts  and  the  observations  and  is  a  joint  property  of  the 
forecasts  and  the  events  that  materialize.  Sharpness  refers  to  the  concentration  of  the 
predictive  distributions  and  is  a  property  of  the  forecasts  only. 

Searing  rules  provide  summary  measures  for  the  evaluation  of  probabilistic  forecasts,  by 
assigning  a  numerical  score  based  on  the  forecast  and  on  the  event  or  value  that  materializes. 
In  terms  of  elicitation,  the  role  of  scoring  rules  is  to  encourage  the  assessor  to  make  careful 
assessments  and  to  be  honest.  In  terms  of  evaluation,  scoring  rules  measure  the  quality  of 
the  probabilistic  forecasts  and  reward  assessors  for  forecasting  jobs.  In  a  Bayesian  context, 
scores  are  frequently  referred  to  as  utilities,  thereby  emphasizing  the  Bayesian  principle  of 
maximizing  the  expected  utility  of  a  predictive  distribution  (Bernardo  and  Smith  1994).  We 
take  scoring  rules  to  be  positively  oriented  rewards  that  a  forecaster  wishes  to  maximize. 
Specifically,  if  the  forecaster  quotes  the  predictive  distribution  P  and  the  event  x  materi¬ 
alizes,  her  reward  is  S{P,x).  The  function  S{P,-)  takes  values  in  the  extended  real  line 
M  =  [—00,00],  and  we  write  S{P,Q)  for  the  expected  value  of  S{P,-)  under  Q.  Suppose, 
then,  that  the  forecaster’s  best  judgement  is  the  distributional  forecast  Q.  The  forecaster 
has  no  incentive  to  predict  any  P  ^  Q,  and  is  encouraged  to  quote  her  true  belief,  P  =  Q,\i 
S{Q,  Q)  >  S{P,  Q)  with  equality  if  and  only  ii  P  =  Q.  A  scoring  rule  with  this  property  is 
said  to  be  strietly  proper.  If  S{Q,  Q)  >  S{P,  Q)  for  all  P  and  Q  the  scoring  rule  is  said  to  be 
proper.  Propriety  is  essential  in  scientific  and  operational  forecast  evaluation,  and  the  case 
study  in  Section  7  below  provides  a  striking  example  of  some  of  the  difficulties  resulting 
from  the  use  of  intuitively  appealing  but  improper  scoring  rules. 

In  estimation  problems,  strictly  proper  scoring  rules  provide  attractive  loss  and  utility 
functions  that  can  be  tailored  to  a  scientific  problem.  To  fix  the  idea,  suppose  that  we 
wish  to  fit  a  parametric  model  Pq  based  on  a  sample  Xi, . . . ,  A„.  To  estimate  9,  we  might 
measure  the  goodness-of-fit  by  the  mean  score 
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Sn{e)  =  -J2S{Pe,X,), 

where  S'  is  a  strictly  proper  scoring  rule.  If  Oq  denotes  the  true  parameter,  asymptotic 
arguments  indicate  that  argmax^i  Sn{9)  9q  as  n  ^  co.  This  suggests  a  general  approach 
to  estimation:  choose  a  strictly  proper  scoring  rule  that  is  tailored  to  the  problem  at 
hand,  maximize  Sn{9)  over  the  parameter  space,  and  take  6n  =  argmax0  5n(0)  as  the 
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optimum  score  estimator  based  on  the  scoring  rule  S.  Pfanzagl  (1969)  and  Birge  and 
Massart  (1993)  studied  this  approach  under  the  heading  of  minimum  contrast  estimation. 
Maximum  likelihood  estimation  forms  a  special  case  of  optimum  score  estimation,  and 
optimum  score  estimation  forms  a  special  case  of  M-estimation  (Huber  1964)  in  that  the 
function  to  be  optimized  derives  from  a  strictly  proper  scoring  rule.  The  appeal  of  optimum 
score  estimation  lies  in  the  potential  adaptation  of  the  scoring  rule  to  the  problem  at  hand. 
Apparently,  this  approach  has  only  very  recently  been  explored  (Buja,  Stuetzle  and  Shen 
2004;  Gneiting,  Westveld,  Raftery  and  Goldman  2004). 

The  remainder  of  the  paper  is  organized  as  follows.  In  Section  2  we  prove  a  fundamental 
characterization  theorem  for  strictly  proper  scoring  rules  on  general  probability  spaces. 
Section  3  turns  to  scoring  rules  for  categorical  variables.  The  landmark  paper  of  Savage 
(1971,  p.  793)  gave  an  elegant  characterization  of  proper  scoring  rules  that  Savage  described 
as  “figurative.”  Theorem  3.2  provides  a  rigorous  version  thereof,  and  Theorem  3.4  relates 
to  a  more  recent  representation  of  Schervish  (1989).  Bremnes  (2004,  p.  346)  noted  that  the 
literature  on  scoring  rules  for  probabilistic  forecasts  of  continuous  variables  is  sparse.  We 
address  this  issue  in  Section  4  and  discuss  the  spherical,  pseudospherical,  logarithmic  and 
quadratic  score.  The  continuous  ranked  probability  score  has  lately  attracted  the  attention  of 
meteorologists  and  forms  a  special  case  of  a  novel  and  very  general  class  of  scoring  rules,  the 
energy  score.  Section  5  studies  scoring  rules  for  quantile  and  interval  forecasts.  We  show  the 
class  of  proper  scoring  rules  for  quantile  forecasts  to  be  larger  than  conjectured  by  Gervera 
and  Munoz  (1996)  and  introduce  the  interval  score,  a  scoring  rule  for  central  prediction 
intervals  that  is  proper  and  has  intuitive  appeal.  In  Section  6  we  relate  proper  scoring  rules 
to  Bayes  factors  and  to  cross-validation,  and  show  that  a  particular  form  of  cross-validation, 
random-fold  cross-validated  likelihood,  corresponds  to  a  proper  scoring  rule.  This  provides 
one  way  of  defining  proper  scoring  rules  when  parameters  are  being  estimated.  Section 
7  presents  the  aforementioned  case  study  that  concerns  the  use  of  scoring  rules  in  the 
assessment  of  probabilistic  weather  forecasts.  Section  8  turns  to  optimum  score  estimation 
and  closes  the  paper.  We  discuss  both  point  estimation  and  interval  estimation  and  propose 
the  use  of  the  interval  score  as  a  utility  function  that  addresses  width  as  well  as  coverage. 

2  Characterization  of  strictly  proper  scoring  rules 

This  section  introduces  notation  and  characterizes  strictly  proper  scoring  rules.  The  discus¬ 
sion  is  technical  and  readers  with  more  applied  interests  might  skip  ahead  without  significant 
loss  of  continuity.  We  consider  probabilistic  forecasts  on  a  general  sample  space  H.  Let  A  be 
a  fj- algebra  of  subsets  of  H,  and  let  "P  be  a  convex  class  of  probability  measures  on  (H,  ^).  A 
function  on  H  is  V-quasiintegrable  if  it  is  measurable  with  respect  to  A  and  quasiintegrable 
with  respect  to  all  P  G  V  (Bauer  2001,  p.  64).  A  probabilistic  forecast  is  any  probability 
measure  P  G  V.  A  scoring  rule  is  any  extended  real-valued  function  S  :  V  x  kl  ^  R  such 
that  S{P,  ■)  is  P-quasiintegrable  for  all  P  G  V.  Hence,  if  the  forecast  is  P  and  tv  materializes. 
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the  forecaster’s  reward  is  S{P,(jj).  We  define 

S{P,Q)  =  I  S{P,uj)  dQ{uj) 

as  the  expected  score  under  Q  when  the  probabilistic  forecast  is  P.  The  scoring  rule  S  is 
proper  relative  to  V  if 

S{Q,  Q)  >  S{P,  Q)  for  all  P,QeV.  (1) 

It  is  strictly  proper  relative  to  V  if  (1)  holds  with  equality  if  and  only  if  P  =  Q,  thereby 
encouraging  honest  quotes  by  the  forecaster. 

A  function  G  :  V  is  convex  if 

G((1-A)Po  +  APi)  <  (1-A)G(Po)  +  AG(Pi)  for  all  A  G  (0, 1),  Pq,  A  G  7^.  (2) 

It  is  strictly  convex  if  (2)  holds  with  equality  if  and  only  if  Pq  =  Pi-  A  function  G*{P,  ■ )  : 
n  ^  M  is  a  subtangent  of  G  at  the  point  P  G  P  if  it  is  P-quasiintegrable  and 

G{Q)  >  G{P)  +  I  G*{P,co)  dQ{co)  -  j  G*{P,io)  dP{co)  (3) 

for  all  Q  G  V.  The  following  theorem  generalizes  previous  results  by  McCarthy  (1956)  and 
Hendrickson  and  Buehler  (1971). 

Theorem  2.1  The  scoring  rule  S  :  P  x  H  — >  M  is  (strictly)  proper  if  and  only  if  there  exists 
a  (strictly)  convex  function  G  :  P  ^  R  such  that  G{P)  =  S{P,  P)  for  P  G  P  and 

S{P,co)  =  G*{P,co)  (4) 

for  P  G  P  and  u;  G  where  G*{P,  ■)  :  Q  ^  R  is  a  subtangent  of  G  at  the  point  P  G  P. 

Proof.  If  S{P,uj)  =  G*{P,(jj)  is  of  the  stated  form  the  subtangent  inequality  (3)  implies 
(1),  that  is,  propriety.  Conversely,  suppose  that  S  is  a  proper  scoring  rule.  Define  G  :  P  ^  R 
by  G{P)  =  5(P,  P).  Then  the  subtangent  inequality  (3)  holds  with  G*{P,(jj)  =  5(P,  w), 
which  is  a  P-quasiintegrable  function.  Furthermore, 

G{P)  =  SUpggp  S{Q,P) 

is  the  pointwise  supremum  over  a  class  of  convex  functions  and  therefore  convex  on  P.  This 
proves  the  claim  for  propriety.  In  analogy  to  an  argument  of  Hendrickson  and  Buehler 
(1971),  strict  inequality  in  (1)  is  equivalent  to  no  subtangent  of  G  at  P  being  a  subtangent 
of  G  at  Q,  for  P,QgP  and  P  Q,  and  this  is  equivalent  to  G  being  strictly  convex  on  P. 


Expressed  slightly  differently,  the  scoring  rule  S  is  (strictly)  proper  if  and  only  if  the 
expected  score  function  G(P)  =  S{P,  P)  is  (strictly)  convex  on  P  and  S{P,  • )  is  a  subtangent 
of  G  at  the  point  P,  for  all  P  G  P.  A  comparison  of  Theorem  2.1  to  a  more  direct  extension 
of  the  McCarthy  (1956)  and  Hendrickson  and  Buehler  (1971)  characterization  is  given  in 
the  appendix. 
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3  Scoring  rules  for  categorical  variables 


We  now  discuss  the  representations  of  Savage  (1971)  and  Schervish  (1989)  that  characterize 
scoring  rules  for  probabilistic  forecasts  of  categorical  and  binary  variables,  respectively. 

3.1  Savage  representation 

We  consider  probabilistic  forecasts  of  a  categorical  variable.  Hence,  the  sample  space  H  = 
{1, . . .  ,m}  consists  of  a  finite  number  m  of  mutually  exclusive  events,  and  a  probabilistic 
forecast  is  a  probability  vector  (pi, . . .  ,Pm)-  Using  the  notation  of  Section  2,  we  take  V  =  Vm 
where 

'Pm  =  [p=  {pi,  ■  ■  ■  ,Pm-l)  :  Pi,  •  •  •  ,Pm-l  >  0,  pi  H - hPm-1  <  l} 

denotes  the  unit  simplex  in  A  scoring  rule  S  can  then  be  identified  with  a  collection 

of  m  functions 

S{-  ,i)  :Vm  1  =  l,...,m; 

that  is,  if  the  forecaster  quotes  the  probability  vector  p  and  the  event  i  materializes,  her 
reward  is  S{p,i).  Theorem  3.2  provides  a  rigorous  version  of  the  Savage  (1971)  representa¬ 
tion  of  (strictly)  proper  scoring  rules  on  finite  sample  spaces.  Our  contribution  lies  in  the 
rigorous  treatment  and  in  the  introduction  of  appropriate  tools  of  convex  analysis.  We  recall 
from  Rockafellar  (1970,  Sections  23-25)  that  any  nonconstant  convex  function  G  :  Vm  1^ 
that  is  bounded  above  is  bounded  below,  and  is  continuous  except  possibly  at  the  boundary 
of  Vm-  A  vector  G'{p)  =  {G'i{p), . . . ,  G'm_i(p))  G  is  said  to  be  a  subgradient  of  G  at 

the  point  p  G  Vm  if 

G{q)>G{p)  +  {G'{p),q-p)  (5) 

for  all  q  G  Vm-,  where  (•  ,  •)  denotes  a  scalar  product.  The  value  of  the  subgradient  G'{p) 
is  unique  and  equals  the  gradient  at  every  point  p  in  the  interior  of  Vm  at  which  G  is 
differentiable. 

Definition  3.1  A  scoring  rule  S  for  categorical  forecasts  is  regular  if  S{- ,i)  is  bounded 
above  and  real- valued  for  i  =  1, . . . ,  m,  except  possibly  that  if  i  <  m  —  1  and  pi  =  0  then 
S{p,  i)  =  — oo,  and  if  pi  Pm-i  =  1  then  S{p,  m)  =  — oo. 

Theorem  3.2  (Savage)  A  regular  searing  rule  S  for  eategorieal  foreeasts  is  (strietly)  proper 
if  and  only  if 


S{p,i)=G{p)-{G'{p),p)  +  G'i{p)  for  i  =  l,...,m-l,  (6) 

and 

S{p,m)  =  G{p)  -  {G\p),p),  (7) 

where  G  :  Vm  ^  M  is  a  bounded  (strietly)  eonvex  funetion  and  G'{p)  is  a  subgradient  of  G 
at  the  point  p,  for  all  p  G  Vm- 
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Proof.  If  the  scoring  rule  S  admits  the  representation  (6)  and  (7)  with  a  bounded  convex 
function  G,  the  subgradient  inequality  (5)  implies  propriety  and  S  is  regular.  Conversely, 
suppose  that  S'  is  a  regular  proper  scoring  rule.  By  Theorem  2.1,  S{p,  ■)  =  G*  {p,  ■)  where 
G*{p,  • )  is  a  subtangent  of  the  bounded  convex  function 


m— 1  /  m—1  \ 

G{p)  =  S{p,p)  =  PjS{p,j)  +  1  -  XI  S{p,m).  (8) 

1=1  V  1=1  / 

The  subtangent  inequality  (3)  implies  that  G{q)  >  G{p)  +  {G'{p),q  —  p)  for  all  p,q  ^  Vm, 
thereby  showing  that  G'(p)  =  S{p,i)  —  S{p,m)  is  the  ith  component  of  a  subgradient  of  G 
at  p,  for  i  =  1, . . . ,  m  —  1  and  for  all  p  £  Vm-  Summing  over  i  =  1, . . . ,  m  —  1,  we  obtain 
the  representation  (7)  and  then  (6),  and  this  proves  the  claim  for  propriety.  The  claim  for 
strict  propriety  is  immediate  from  Theorem  2.1.  ■ 

Corollary  3.3  A  regular  scoring  rule  S  for  categorical  forecasts  is  (strictly)  proper  if  and 
only  if  the  expected  score  function  G{p)  =  S{p,p)  is  (strictly)  convex  on  Vm,  md  the  vector 
with  components  S{p,  i)  —  S{p,  m)  for  i  =  l,...,m  —  1  is  a  subgradient  of  G  at  the  point  p, 
for  all  p  £  Vm- 

In  view  of  Theorem  3.2,  every  bounded  strictly  convex  function  G  :  Vm  IK  generates 
a  strictly  proper  scoring  rule.  We  put  pm  =  1  —  Pj  note  a  number  of  examples.  If 

G{p)  =  —  ~Pj)  then  (6)  and  (7)  yield  the  Brier  score  or  quadratic  score,  S{p,  i)  = 

‘2pi  —  l  —  J2^=iP‘j-  The  negative  of  the  entropy  function,  G{p)  =  Pj  Pj ,  corresponds 
to  the  logarithmic  score,  S{p,i)  =  logp*.  These  scores  are  classical  and  were  proposed  by 
Brier  (1950)  and  Good  (1952),  respectively.  Another  well-known  scoring  rule  is  the  spherical 
score  which  corresponds  to  the  expected  score  function  G{jp)  =  (pf  -!-•••  +  so  that 

S{p,  i)  =  Pi/{pi  +  -  •  '+plmY^‘^  ■  The  quadratic,  logarithmic  and  spherical  score  are  symmetric 
in  the  sense  that 

S{{pi,---,Pm-l),i)  =  S  ((Ptti,  •  •  •  ,P7r™_i),Vri) 

for  all  p  £  Vm,  for  all  permutations  vr  on  m  elements  and  for  all  events  i  =  1, . . . ,  m.  Win¬ 
kler  (1994)  argued  that  symmetric  rules  do  not  always  appropriately  reward  forecasting 
skill  and  called  for  asymmetric  ones.  Asymmetric  scoring  rules  can  be  generated  by  ap¬ 
plying  Theorem  3.2  to  strictly  convex  functions  G  that  are  not  invariant  under  coordinate 
permutation. 

3.2  Schervish  representation 

The  classical  case  m  =  2  of  binary  “yes”  or  “no”  forecasts  calls  for  further  discussion.  We 
follow  the  literature  in  considering  the  sample  space  H  =  {1, 0}.  A  probabilistic  forecast  is  a 
quoted  probability  p  £  [0, 1]  for  “yes,”  or  1,  and  a  scoring  rule  S  can  be  identified  with  a  pair 
of  functions  S{-  ,1)  :  [0, 1]  ^  M  and  S'(  • ,  0)  :  [0, 1]  ^  M.  Hence,  S{p,  1)  is  the  forecaster’s 
reward  is  she  quotes  p  and  the  event  materializes,  and  S{p,0)  is  the  compensation  if  she 
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quotes  p  and  the  event  does  not  materialize.  By  Theorem  3.2,  every  regular  (strictly)  proper 
scoring  rule  for  binary  variables  is  of  the  form 

S{p,  1)  =  G{p)  +  (1  -  p)  G\p),  S{p,  0)  =  G{p)  -  pG'ip)  (9) 

where  G  :  [0, 1]  ^  M  is  a  bounded  (strictly)  convex  function  and  G'{p)  is  a  subgradient  of 
G  at  the  point  p,  for  all  p  G  [0, 1].  If  G  is  differentiable  at  an  interior  point  p  G  (0, 1)  then 
G'{p)  is  simply  the  derivative  of  G  at  p.  Putting  G(p)  =  — 2p(l  —  p)  and  G(p)  =  plogp  + 
(1  —p)  log(l  —p)  in  (9)  recovers  the  Brier  score  and  the  logarithmic  score,  respectively.  The 
expected  score  function  G(p)  =  — p^/^(l  —  p)^/^  has  been  associated  with  boosting  (Buja  et 
al.  2004).  Any  strictly  proper  scoring  rule  for  binary  variables  is  regular.  Furthermore,  any 
proper  binary  scoring  rule  that  is  not  regular  satisfies  either  max{5(p,  1),  5(p,  0)}  =  +oo  for 
all  p  G  (0, 1),  or  min{5(p,  1),  S{p,  0)}  =  — oo  for  all  p  G  (0, 1),  and  therefore  is  uninteresting. 

The  Savage  representation  (9)  implies  various  interesting  properties  of  (strictly)  proper 
regular  scoring  rules.  For  instance.  Theorem  24.2  of  Rockafellar  (1970)  implies  that 

5(p,  1)  =  lirri  G{q)  -  [  {G'{q)  -  G'{p))  dq  (10) 

9^1  Jp 

for  p  G  (0,1),  and  since  G'(p)  is  (strictly)  increasing,  S'(p,  1)  is  (strictly)  increasing,  too. 
Similarly,  5(p,  0)  is  (strictly)  decreasing,  as  one  intuitively  expects.  Alternative  proofs  of 
these  and  other  results  can  be  found  in  the  appendix  of  Schervish  (1989). 

Schervish  (1989,  p.  1861)  suggested  that  his  Theorem  4.2  generalizes  the  Savage  repre¬ 
sentation.  Given  Savage’s  (1971,  p.  793)  assessment  of  his  representation  (9.15)  as  “figu¬ 
rative,”  the  claim  can  well  be  justified.  However,  in  the  rigorous  form  of  Theorem  3.2  the 
representation  of  Savage  applies  to  a  larger  class  of  scoring  rules  than  that  of  Schervish. 

Theorem  3.4  (Schervish)  Suppose  S  is  a  regular  searing  rule.  Then  S  is  proper  and 
sueh  that  5(0, 1)  =  limp^o  S{p,  1),  5(0, 0)  =  limp^o  S{p,  0)  and  both  S{p,  1)  and  5(p,  0)  are 
left  eontinuous  if  and  only  if  there  exists  a  measure  v  on  (0, 1)  sueh  that 

S{p,l)  =  S{l,l)  -  f  {l-q)n{dq),  5(p,  0)  =  5(0, 0)  -  /  qn{dq)  (11) 

J[p,l)  J[0,p) 

for  all  p  G  [0, 1].  The  searing  rule  is  strietly  proper  if  and  only  if  v  assigns  positive  measure 
to  every  open  interval. 

Proof.  Suppose  5  satisfies  the  assumptions  of  the  theorem.  To  prove  that  5(p,  1)  is  of 
the  form  (11)  consider  the  representation  (10),  identify  the  increasing  function  G'{p)  with 
the  left  continuous  distribution  function  of  a  measure  v  on  (0,1),  and  apply  the  partial 
integration  formula.  The  proof  of  the  representation  for  5(p,  0)  is  analogous.  For  the 
proof  of  the  converse  and  the  statement  for  strict  propriety  we  refer  to  Schervish  (1989, 
pp.  1876-1877).  ■ 

Schervish  (1989)  proposed  a  general  method  for  comparing  binary  forecasters  within 
the  framework  of  two-decision  problems.  A  two-decision  problem  can  be  characterized  by 
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a  cost-loss  ratio  q  ^  [0,1]  that  reflects  the  relative  costs  of  the  two  possible  types  of  inferior 
decision.  The  measure  ^{dq)  in  the  representation  (11)  assigns  relevance  to  distinct  cost-loss 
ratios.  For  instance,  the  Brier  score  corresponds  to  a  uniform  measure,  and  the  logarithmic 
score  corresponds  to  the  infinite  measure  with  Lebesgue  density  (g(l  —  q))~^-  Buja  et 
al.  (2004)  took  this  approach  a  major  step  further.  They  suggested  a  taxonomy  of  scores 
and  introduced  a  parametric  family  of  strictly  proper  scoring  rules  that  includes  the  Brier 
score,  the  logarithmic  score  and  the  aforementioned  scoring  rule  that  underlies  boosting. 

4  Scoring  rules  for  continuous  variables 

Bremnes  (2004,  p.  346)  noted  that  the  literature  on  scoring  rules  for  probabilistic  forecasts 
of  continuous  variables  is  sparse.  We  address  this  issue  in  the  following. 

4.1  Scoring  rules  for  density  forecasts 

Let  ^  be  a  a-finite  measure  on  the  measurable  space  (11,^).  For  a  >  1,  let  denote  the 
class  of  probability  measures  on  (11,  yl)  that  are  absolutely  continuous  with  respect  to  /x 
and  have  ;U-density  p  such  that 

\\p\\a  =  (^J 

is  finite.  We  identify  a  probabilistic  forecast  P  £  La  with  its  ;U-density  p  and  call  p  a 
predictive  density  or  density  forecast.  Predictive  densities  are  defined  only  up  to  a  set  of  p- 
measure  zero.  Whenever  appropriate,  we  follow  Bernardo  (1979,  p.  689)  and  use  the  unique 
version  defined  by  p{uj)  =  \\m.p^QP[Sp{uj))/ p{Sp{uj))  where  Sp{uj)  is  a  sphere  of  radius  p 
centered  at  lo. 

Good  (1971)  proposed  the  pseudospherical  score, 

PsendoS(p,^)=^(^(04)  (12) 

that  reduces  to  the  spherical  score  when  a  =  2.  He  described  original  and  generalized 
versions  of  the  score  —  a  distinction  that  in  a  measure-theoretic  framework  is  obsolete. 
The  pseudospherical  score  is  strictly  proper  relative  to  La,  as  noted  by  Good,  and  the 
representation  (4)  holds  where 

0(/>)  =  ^(ll/>ll«-l)- 

The  strict  convexity  of  G  and  the  associated  subtangent  inequality  follow  from  the  Holder 
and  Minkowski  inequalities,  respectively.  Taking  the  limit  as  a  ^  1  in  (12)  yields  the 
logarithmic  score,  LogS(p,ti;)  =  log  ^(u;),  which  was  proposed  by  Good  (1952)  and  has 
been  widely  used  since.  Roulston  and  Smith  (2002)  gave  an  information  theoretic  perspec¬ 
tive  and  an  interpretation  in  terms  of  gambling  returns.  The  logarithmic  score  is  strictly 


proper  relative  to  the  convex  class  £1  of  the  probability  measures  that  are  dominated  by 
fj,.  The  associated  expected  score  function  is  negative  entropy.  Well-known  properties  of 
the  Kullback-Leibler  divergence  imply  its  strict  convexity  and  the  associated  subtangent 
inequality.  Bernardo  (1979,  p.  689)  argued  that  “when  assessing  the  worthiness  of  a  scien¬ 
tist’s  final  conclusions,  only  the  probability  he  attaches  to  a  small  interval  containing  the 
true  value  should  be  taken  into  account.”  This  seems  subject  to  debate,  and  atmospheric 
scientists  have  argued  otherwise,  putting  forth  scoring  rules  that  are  sensitive  to  distanee 
(Epstein  1969,  Stael  von  Holstein  1970).  That  said,  Bernardo  (1979)  studied  heal  scoring 
rules  S{p,uj)  that  depend  on  the  predictive  density  p  only  through  its  value  at  the  event 
Lv  that  materializes.  Assuming  regularity  conditions,  he  showed  that  every  proper  local 
scoring  rule  is  of  the  form  S{p,uj)  =  alogp(w)  -|-  f{uj)  for  some  constant  a  >  0  and  some 
function  /. 

Consequently,  the  linear  seore,  LinS(p,ti;)  =  p{u!),  is  not  a  proper  scoring  rule,  despite 
its  intuitive  appeal.  For  instance,  if  (H,A)  =  (M,H),  where  B  is  the  cr-algebra  of  Borel 
sets  and  p  is  Lebesgue  measure,  let  cf)  and  denote  the  density  of  the  standard  normal 
distribution  and  the  uniform  distribution  on  (— e,  e),  respectively.  If  e  <  ^log  2  then 

LinS(u„  (t>)  =  Ye  /_,  ^ 

in  violation  of  propriety.  Essentially,  the  linear  score  encourages  overprediction  at  the 
modes  of  an  assessor’s  true  predictive  density.  Alternatives  to  the  linear  score  that  are 
strictly  proper  relative  to  the  class  £2  include  the  spherical  score,  defined  above,  and  the 
quadratie  seore, 

QS{p,uj)  =  2p{uj)  -  J {p{-)f  dp{-), 

which  has  expected  score  function  G{p)  =  ||p||2  and  corresponds  to  the  Brier  score  when 
H  is  finite.  The  probability  score  of  Wilson,  Burrows  and  Lanzinger  (1999)  integrates  the 
predictive  density  over  a  neighborhood  of  the  observed,  real-valued  quantity.  This  resembles 
the  linear  score  and  is  not  a  proper  score  either. 

4.2  Continuous  ranked  probability  score 

The  restriction  to  absolutely  continuous  predictive  distributions  is  frequently  impractical. 
Probabilistic  quantitative  precipitation  forecasts,  for  instance,  involve  distributions  with 
a  point  mass  at  zero  (Krzysztofowicz  and  Sigrest  1999;  Bremnes  2004).  This  could  be 
handled  by  considering  densities  with  respect  to  a  mixed  dominating  measure  rather  than 
Lebesgue  measure,  but  the  resulting  scores  seem  difficult  to  interpret.  Furthermore,  the 
scores  discussed  in  Section  4.1  are  not  sensitive  to  distance,  meaning  that  no  credit  is  given 
for  assigning  high  probabilities  to  values  near  but  not  identical  to  the  one  materializing. 
Sensitivity  to  distance  seems  particularly  desirable  when  the  predictive  distribution  is  mul¬ 
timodal. 

To  address  this  situation,  let  V  consist  of  all  probability  measures  on  (H,yl)  =  (M,H). 
We  identify  a  probabilistic  forecast,  that  is,  a  member  of  the  class  V,  with  its  cumulative 
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distribution  function  F,  and  we  use  standard  notation  for  the  elements  of  the  sample  space 
M.  Let  l{y  >  x}  denote  the  function  that  attains  the  value  1  'd  y  >  x  and  the  value  0 
otherwise.  The  continuous  ranked  probability  score  is  defined  as 

/OO 

(F(y)  -  l{y  >  x}f  dy  (13) 

-OO 


and  corresponds  to  the  integral  of  the  Brier  scores  for  the  associated  binary  probabilistic 
forecasts  at  all  real- valued  thresholds  (Matheson  and  Winkler  1976;  Hersbach  2000).  Appli¬ 
cations  of  the  continuous  ranked  probability  score  have  been  hampered  by  a  lack  of  analytic 
expressions,  and  the  use  of  numerical  quadrature  rules  for  the  evaluation  of  (13)  has  been 
proposed  instead  (Stael  von  Holstein  1977;  Unger  1985).  By  Proposition  1  of  Sz&ely  (2003), 

CRPS(F,x)  =  ^Ef\X  -X'\  -Ef\X  -x\,  (14) 

where  X  and  X'  are  independent  copies  of  a  random  variable  with  distribution  function  F 
and  finite  first  moment.  When  the  predictive  distribution  is  normal,  it  follows 

readily  from  (13)  or  (14)  that 


CRPS(A7(/i,u2),x)  = 


X  —  pi 

a 


erf 


X- 


\/2  exp 


2a^  ))’ 


where  erf  denotes  the  error  function  (Gneiting  et  al.  2004).  Similarly,  analytical  expressions 
can  be  derived  for  many  other  distributions.  If  a  closed  form  expression  is  not  available 
but  random  numbers  with  distribution  F  can  be  generated,  the  right-hand  side  of  (14) 
can  be  evaluated  by  Monte  Carlo  techniques.  The  continuous  ranked  probability  score  is 
proper  but  not  strictly  proper  relative  to  "P.  It  is  strictly  proper  relative  to  the  class  Pi  of 
probability  measures  on  (M,H)  that  have  finite  first  moment. 

The  continuous  ranked  probability  score  has  lately  found  renewed  interest  in  the  atmo¬ 
spheric  sciences  community  (Hersbach  2000;  Gneiting  et  al.  2004).  It  is  typically  used  in 
negative  orientation,  say  CRPS*(P, x)  =  —  CRPS(P, x).  The  representation  (14)  can  then 
be  written  as 

CRPS*(P,x)  =Ef\X  -x\  -^Ef\X  -  X'\, 

which  sheds  new  light  on  the  score.  In  negative  orientation,  the  continuous  ranked  probabil¬ 
ity  score  can  be  reported  in  the  same  unit  as  the  observations,  and  it  generalizes  the  absolute 
error  to  which  it  reduces  if  P  is  a  deterministic  forecast  —  that  is,  a  point  measure.  Thus 
the  continuous  ranked  probability  score  provides  a  direct  way  of  comparing  deterministic 
and  probabilistic  forecasts. 

Matheson  and  Winkler  (1976)  proposed  a  univariate  generalization  of  the  continuous 
ranked  probability  score.  We  generalize  further  and  propose  a  multivariate  version  thereof. 
If  P  denotes  the  class  of  probability  measures  on  (M”^,H”^),  we  identify  a  probabilistic 
forecast  P  G  V  with  its  cumulative  distribution  function  F.  The  multivariate  continuous 
ranked  probability  score  is  defined  as 


GRPS(P,  x) 


{F{y)  -  l{y  >  x}f  dW{y), 
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where  the  integral  is  taken  with  respect  to  some  measure  W  on  This  is  a  weighted 

integral  of  the  Brier  scores  at  all  m- variate  thresholds,  and  the  measure  W  can  be  chosen 
to  encourage  the  forecaster  to  concentrate  her  efforts  on  the  important  ones.  In  analogy  to 
Eq.  (24)  of  Matheson  and  Winkler  (1976), 

CRPS(F,Fo)  =  (m  -  ^o(y)f  dW{y)  -  F{y)  (1  -  F{y))  dW{y). 

The  multivariate  continuous  ranked  probability  score  is  proper  relative  to  "P.  If  IT  is  a  finite 
measure  that  dominates  Lebesgue  measure,  the  expected  score  function  is  real-valued  and 
strictly  convex,  and  the  score  is  strictly  proper  relative  to  V. 


4.3  Energy  score 

This  section  introduces  another  generalization  of  the  continuous  ranked  probability  score 
that  draws  on  Sz&ely’s  (2003)  statistical  energy  perspective.  Let  Pq,,  a  G  (0,2),  denote 
the  class  of  probability  measures  P  on  (yi,A)  =  (M™,P™)  which  are  such  that  Pp||X||“  is 
finite,  where  ||  •  ||  denotes  the  Euclidean  norm.  We  define  the  energy  seore 

ES(P,x)  =  ipp||X-X'||“-Pp||X-x||",  (15) 

where  X  and  X'  are  independent  copies  of  an  m-variate  random  vector  with  distribution  P. 
This  generalizes  the  univariate  continuous  ranked  probability  score,  to  which  (15)  reduces 
when  a  =  1  and  m  =  1,  by  allowing  for  an  index  a  G  (0, 2)  and  by  applying  to  distributional 
forecasts  of  a  vector-valued  quantity.  The  evaluation  of  (15)  is  straightforward  using  Monte 
Carlo  techniques.  To  prove  that  the  energy  score  is  strictly  proper  relative  to  Pq,,  recall 
Theorem  2.1  and  note  from  Theorem  1  of  Sz&ely  (2003)  that  ES(P,  x)  is  a  subtangent  of 

G{P)  =  -^Ep\\X  -  X^f  , 


which  is  a  strictly  convex  function  on  Pq,.  The  score  has  the  potentially  desirable  property 
of  invariance  under  joint  translation  and/or  rotation  of  P  and  x.  In  negative  orientation, 
it  can  be  interpreted  as  a  generalization  of  the  absolute  error  of  order  a. 

The  energy  score  applies  to  the  class  Pq  of  all  probability  measures  on  by 

defining 


ES(P,  x)  =  - 


a2 


0-2 


r(f  +  f)  f  \^{y)-F(-^y)\ 


\m-\-Q 


dy, 


(16) 


W2r(i-f)  /m™  llyll 

where  (p  denotes  the  characteristic  function  of  P.  Essentially,  the  score  computes  a  weighted 
distance  between  the  characteristic  function  of  P  and  the  characteristic  function  of  the 
point  measure  at  the  value  that  materializes.  This  is  akin  to  the  metric  studied  by  Eaton, 
Giovagnoli  and  Sebastiani  (1996,  p.  124).  If  P  belongs  to  Pq,  Theorem  1  of  Szekely  (2003) 
implies  the  equality  of  the  right-hand  sides  in  Eqs.  (15)  and  (16),  respectively.  Relative  to 
the  full  class  Pq,  the  energy  score  is  proper  but  not  strictly  proper. 
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4.4  Predictive  model  choice  criterion 

The  predictive  model  choice  criterion  of  Laud  and  Ibrahim  (1995)  and  Gelfand  and  Ghosh 
(1998)  has  lately  attracted  the  attention  of  the  statistical  community.  Suppose  that  we 
fit  a  predictive  model  to  observed  data  xi,. . .  ,Xn-  The  predictive  model  choice  criterion 
(PMGG)  assesses  the  model  fit  through  the  quantity 

n  n 

VMCC  =  Y,{,x,-x,f  +  Y,al 

i=l  i=l 

where  and  erf  denote  the  expected  value  and  the  variance,  respectively,  of  a  replicate 
variable  Xi,  given  the  model  and  the  observations.  Within  the  framework  of  scoring  rules, 
the  PMGG  corresponds  to  the  positively  oriented  score 

S{P,x)  =  -  {EpX  -  xf  -  Varp(X), 

where  X  is  a  random  variable  with  distribution  P,  which  is  not  a  proper  scoring  rule:  If  the 
forecaster’s  true  belief  is  P  and  if  she  wishes  to  maximize  the  expected  score,  she  will  quote 
the  point  measure  at  EpX  —  that  is,  a  deterministic  forecast  —  rather  than  the  predictive 
distribution  P. 

5  Scoring  rules  for  quantile  and  interval  forecasts 

Occasionally,  full  predictive  distributions  are  difficult  to  specify,  and  the  forecaster  might 
quote  predictive  quantiles  or  prediction  intervals  instead.  Bremnes  (2004)  gave  an  example 
of  this  type  of  situation.  That  said,  specifying  a  predictive  distribution  is  equivalent  to 
specifying  all  predictive  quantiles;  and  we  can  build  scoring  rules  for  predictive  distributions 
from  scoring  rules  for  quantiles.  Matheson  and  Winkler  (1976)  and  Gervera  and  Munoz 
(1996)  suggested  ways  of  doing  this.  For  instance,  we  might  sum  the  interval  score  (19), 
introduced  below,  over  prediction  intervals  with  equidistant  or  representative  probability 
content. 

5.1  Proper  scoring  rules  for  quantiles 

We  consider  probabilistic  forecasts  of  a  continuous  quantity  that  take  the  form  of  predictive 
quantiles.  Specifically,  suppose  that  the  quantiles  at  the  levels  ai, . . .  ,ak  G  (0, 1)  are  sought. 
If  the  forecaster  quotes  the  quantiles  ri, . . . ,  and  x  materializes,  she  will  be  rewarded  by 
the  score  ^(ri, . . . ,  x).  We  define 

S{ri,...,rk;P)  =  J  S’(ri, . . . ,  r^;  x)  dP(x) 

as  the  expected  score  under  the  probability  measure  P  when  the  forecaster  quotes  the 
quantiles  ri, . . .  ,rfc.  To  avoid  technical  complications,  we  suppose  that  P  belongs  to  the 
convex  class  V  of  probability  measures  on  (M,  B)  that  have  finite  moments  of  all  orders  and 
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whose  distribution  function  is  strictly  increasing  on  M.  For  P  G  "P,  let  (71, . . . ,  denote  the 
true  P-quantiles  at  levels  ai, . . .  ,0^.  Following  Cervera  and  Munoz  (1996),  we  say  that  a 
scoring  rule  S  is  proper  if 


S{qi,...,qk;P)  >  S(ri,...,rfc;P) 

for  all  real  numbers  ri, . . .  ,rfc  and  for  all  probability  measures  P  G  V.  If  S'  is  proper,  the 
forecaster  who  wishes  to  maximize  the  expected  score  is  encouraged  to  be  honest  and  to 
volunteer  her  true  beliefs. 

To  avoid  technical  overhead,  we  tacitly  assume  P-integrability  whenever  appropriate. 
Essentially,  we  require  that  the  functions  s{x)  and  h(x)  in  (17)  grow  at  most  polynomially 
in  X.  We  write  l{x  <  r}  for  the  function  that  attains  the  value  1  if  x  <  r  and  the  value  0 
otherwise.  Theorem  5.1  addresses  the  prediction  of  a  single  quantile;  Corollary  5.2  turns  to 
the  general  case. 

Theorem  5.1  If  s  is  nondecreasing  and  h  is  arbitrary,  the  scoring  rule 

S{r]  x)  =  as(r)  +  (s(x)  —  s(r))  l{x  <  r}  +  h{x)  (17) 

is  proper  for  predicting  the  quantile  at  level  a  G  (0, 1). 

Proof.  Let  q  be  the  unique  a-quantile  of  the  probability  measure  P  G  P.  We  identify  P 
with  the  associated  distribution  function  so  that  P{q)  =  a.  If  r  <  g  then 

S{q,  P)  —  S{r,  P)  =  /  s(x)  dP(x)  +  s(r)P(r)  —  Q;s(r) 
pr,q) 

>  s{r){P{q)  —  P{r))  +  s{r)P{r)  —  as{r)  =  0, 

as  desired.  If  r  >  g  an  analogous  argument  applies.  ■ 

Corollary  5.2  If  Si  is  nondecreasing  for  i  =  1, . . .  ,k  and  h  is  arbitrary  the  scoring  rule 

k 

5(ri, . . . ,  rfc;  x)  =  ^  (aiSi(r)  +  {si{x)  -  Si{ri))  l{x  <  rj})  +  h{x)  (18) 

i=l 

is  proper  for  predicting  the  quantiles  at  levels  oi, . . .  ,ak  G  (0, 1). 

Cervera  and  Munoz  (1996,  pp.  515  and  519)  proved  Corollary  5.2  in  the  special  case  in 
which  each  Sj  is  linear.  They  asked  whether  the  resulting  rules  are  the  only  proper  ones  for 
quantiles.  Our  results  give  a  negative  answer;  that  is,  the  class  of  proper  scoring  rules  for 
quantiles  is  considerably  larger  than  anticipated  by  Cervera  and  Munoz.  We  do  not  know 
whether  or  not  (17)  and  (18),  respectively,  provide  the  general  form  of  proper  scoring  rules 
for  quantiles. 
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5.2  Interval  score 

Interval  forecasts  form  a  crucial  special  case  of  quantile  prediction.  We  consider  the  classical 
case  of  the  central  (1  —  a)  x  100%  prediction  interval,  whose  lower  and  upper  endpoints 
are  given  by  the  predictive  quantile  at  level  j  and  1  —  f  •  We  denote  a  scoring  rule  for  the 
associated  interval  forecast  by  Sa{l,u-,x),  where  I  and  u  stand  for  the  quoted  ^  and  1  —  f 
quantile,  respectively.  Hence,  if  the  forecaster  quotes  the  (1  —  a)  x  100%  central  prediction 
interval  [l,u\  and  x  materializes,  her  score  will  be  Sa{l,u]x).  Putting  ai  =  j,  a2  =  1  —  j, 
si(x)  =  S2(x)  =  4x  and  h(x)  =  —2x  in  (18)  yields  the  interval  score, 

(  —  2a{u  —  1)  —  4(1  —  x)  if  x<l, 

Sa{l,u]x)  =  l  —2a{u  —  l)  if  I  <  X  <  u,  (19) 

[  —  2a{u  —  1)  —  4{x  —  u)  if  x  >  u. 

This  scoring  rule  has  intuitive  appeal  and  —  in  the  form  of  a  utility  function  —  can  be 
traced  back  at  least  to  Dunsmore  (1968)  and  Winkler  (1972).  The  forecaster  is  rewarded  for 
narrow  prediction  intervals,  and  she  avoids  a  penalty  if  the  interval  covers  the  observation. 
In  the  particular  case  a  =  0.50,  Hamill  and  Wilks  (1995,  p.  622)  used  a  score  that  is 
negatively  oriented  but  equivalent  to  the  interval  score.  They  noted  that  “a  strategy  for 
gaming  [. . .  ]  was  not  obvious”  which  is  confirmed  by  the  propriety  of  the  score. 

5.3  Prediction  intervals  for  a  conditionally  heteroscedastic  process 

Kabaila  (1999)  called  for  rigorous  ways  of  specifying  prediction  intervals  for  conditionally 
heteroscedastic  processes  and  proposed  a  relevance  criterion  in  terms  of  conditional  coverage 
and  width  dependence.  We  contend  that  the  notion  of  proper  scoring  rules  provides  a 
simpler,  more  general  and  more  rigorous  paradigm.  The  prediction  intervals  that  we  deem 
appropriate  derive  from  the  true  conditional  distribution,  as  implied  by  the  data  generating 
mechanism,  and  thereby  maximize  the  expected  value  of  all  proper  scores. 

To  fix  the  idea,  consider  the  stationary  bilinear  process  {Xt  :  t  G  Z}  defined  by 

=  \xt-i  +  \xt-iet  +  et  (20) 

where  the  et  are  independent  standard  normal  random  variates.  Kabaila  and  He  (2001) 
studied  central  one-step  ahead  prediction  intervals  at  the  95%  level.  The  process  is  Marko¬ 
vian,  and  the  conditional  distribution  of  Xt+i  given  Xt,Xt-i, . . .  is  Gaussian  with  mean 
and  variance  (1  -|-  thereby  suggesting  the  prediction  interval 

I  =  ij^Xt  —  c  1  +  -Xt  ,  -Xt  -|-  c  1  -|-  -Xt  ,  (21) 

where  c  =  d>“^(0.975).  This  interval  satisfies  the  relevance  property  of  Kabaila  (1999), 
and  Kabaila  and  He  (2001)  adopted  I  as  the  standard  prediction  interval.  We  agree  with 
this  choice,  but  we  prefer  the  aforementioned  more  direct  justification:  the  prediction  in¬ 
terval  I  is  the  standard  interval  because  its  lower  and  upper  endpoints  are  the  2.5%  and 
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Table  1:  One-step  ahead  95%  prediction  intervals  for  the  stationary  bilinear  process  (20). 
Results  of  a  simulation  study  using  100,000  interval  forecasts  each. 


Interval 

Forecast 

Empirical 

Coverage 

Average 

Width 

Average 
Interval  Score 

I 

(21) 

95.01% 

4.00 

-0.48 

J 

(22) 

95.08% 

5.45 

-0.79 

K 

(23) 

94.98% 

3.79 

-0.52 

97.5%  percentiles  of  the  true  conditional  distribution  function,  respectively.  Kabaila  and 
He  considered  two  alternative  prediction  intervals,  namely 

J  =  rF“^(0.025),F“^  (0.975)1  ,  (22) 

where  F  denotes  the  unconditional,  stationary  distribution  function  of  the  Xt,  and 

K  =  [iXi  -  7(|l  +  +  +  ix*!)]  ,  (23) 

where 

(21og  (If  ))■''%  if  0<9<7,36. 

[  0  if  2/  >  7.36. 

This  choice  of  7  minimizes  the  expected  width  of  the  prediction  interval  under  the  constraint 
of  nominal  coverage.  However,  the  interval  forecast  (23)  seems  misguided;  it  collapses  to  a 
point  forecast  when  the  conditional  predictive  variance  is  highest. 

We  generated  a  sample  path  of  length  100,001  from  the  bilinear  process  (20)  and  consid¬ 
ered  the  interval  forecasts  (21),  (22)  and  (23),  respectively.  Table  1  summarizes  the  results 
of  this  experiment.  All  three  interval  forecasts  showed  close  to  nominal  coverage,  and  the 
prediction  interval  (23)  showed  the  smallest  average  width.  Nevertheless,  the  classical  pre¬ 
diction  interval  (21)  showed  the  largest  value  of  the  interval  score. 

6  Scoring  rules,  Bayes  factors  and  random-fold  cross-validation 

6.1  Proper  scoring  rules  and  Bayes  factors 

Probabilistic  forecasting  rules  are  often  generated  by  probabilistic  models,  and  the  standard 
Bayesian  approach  to  comparing  probabilistic  models  is  by  Bayes  factors.  Suppose  we  have  a 
sample  X  =  (Xi, . . . ,  X„)  of  values  to  be  forecast.  Suppose  also  that  we  have  two  forecasting 
rules,  based  on  probabilistic  models  Hi  and  H2.  So  far  in  this  paper  we  have  concentrated 
on  the  situation  where  the  forecasting  rule  is  completely  specified  before  any  of  the  Xj  is 
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observed,  i.e.  there  are  no  parameters  to  be  estimated  from  the  data  being  forecast.  In  that 
situation,  the  Bayes  factor  for  Hi  against  H2  is 

_  P{X\Hi) 

P{X\H2)^ 

where  =  0^=1  =  1)2)  (Jeffreys  1939;  Kass  and  Raftery  1995). 

Thus  if  the  log  score  is  used,  the  log  Bayes  factor  is  the  difference  of  the  scores  for  the 
two  models, 

log(R)  =  LogS{Hi,X)-LogS{H2,X).  (25) 

This  was  pointed  out  by  Good  (1952),  who  called  the  log  Bayes  factor  the  weight  of  evidence. 
It  establishes  two  connections.  First,  the  Bayes  factor  is  equivalent  to  the  log  score  in  this 
“no  parameter”  case.  Second,  it  shows  that  the  Bayes  factor  applies  more  generally  than 
just  to  the  comparison  of  parametric  probabilistic  models,  but  also  to  the  comparison  of 
probabilistic  forecasting  rules  of  any  kind. 

So  far  in  this  paper  we  have  taken  probabilistic  forecasts  to  be  fully  specified,  but  often 
they  are  specified  only  up  to  unknown  parameters  estimated  from  the  data.  Now  suppose 
that  the  forecasting  rules  considered  are  specified  only  up  to  unknown  parameters.  Ok  for 
Hk,  to  be  estimated  from  the  data.  Then  the  Bayes  factor  is  still  given  by  (24),  but  now 
P{X\Hk)  is  the  integrated  likelihood, 

p{x\Hk)  =  J p{x\ek,Hk)p{ek\Hk)  dOk, 

where  p{X\9k,  Hk)  is  the  (usual)  likelihood  under  model  Hk  and  p{6k\Hk)  is  the  prior  dis¬ 
tribution  of  the  parameter  Ok- 

Dawid  (1984)  showed  that  when  the  data  come  in  a  particular  order,  such  as  time  order, 
the  integrated  likelihood  can  be  reformulated  in  predictive  terms: 

n 

P{X\Hk)  =  l[P{Xt\X^-\Hk), 

t=i 

where  =  {Xi, . . . ,  Xt-i},  and  P{Xt\X^~^ ,  Hk)  is  the  predictive  distribution  of  Xt 

given  the  past  values  under  Hk,  namely 

p{Xt\x^-\Hk)  =  J  p{Xt\ek,Hk)P{ek\x^-\Hk)  d^, 

with  P{6k\X^~^ ,  Hk)  being  the  posterior  distribution  of  Ok  given  the  past  observations  X^~^. 

Let  us  denote  by  Sk,B  the  log  integrated  likelihood,  viewed  now  as  a  scoring  rule.  It 
helps  to  view  it  as  a  scoring  rule  to  rewrite  it  as 

n 

Sk,B  =  Y.^ogP{Xt\X^-\Hk). 

t=i 


(26) 


(24) 
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This  is  a  proper  score  with  respect  to  the  Bayesian  predictive  density 

Qk{X)  =  I p{X\ek,Hk)p{9k\Hk)  dOk. 

Dawid  (1984)  showed  that  Sk,B  is  an  approximation  to  the  plug-in  maximum  likelihood 
prequential  score 

n 

Sk,D  =  J2^ogP{Xt\X^-\ei-^),  (27) 

t=l 

where  is  the  maximum  likelihood  estimator  (MLE)  of  9^  based  on  the  past  observations, 
X^~^,  in  the  sense  that  Sk,D/Sk,B  ^  1  as  n  ^  oo.  He  also  showed  that  Sk,B  is  an 
approximation  to  the  BIG  score, 

n  1 

Sk,Bic  =  Y.^ogP{Xt\X^-\9l)  -  ^logn, 

t=i  ^ 

where  dk  =  dim(0fc),  in  the  same  sense,  namely  Sk, bic/ Sk,B  ^  1  as  n  ^  oo.  This  justi¬ 
fies  the  use  of  BIG  for  comparing  forecasting  rules,  extending  the  previous  justification  of 
Schwarz  (1978),  which  related  only  to  comparing  models. 

These  results  have  two  limitations,  however.  First,  they  assume  that  the  data  come  in 
a  particular  order.  Second,  they  use  only  the  log  score,  and  not  other  scores  that  might 
be  more  appropriate  for  the  task  at  hand.  We  now  briefly  consider  how  these  limitations 
might  be  addressed. 

6.2  Scoring  rules  and  random- fold  cross-validation 

Suppose  now  that  the  data  are  unordered.  Then  equation  (26)  holds  for  any  ordering  of 
the  data.  We  can  replace  (26)  by 

n 

Sk,B  =  SIb  =  Y.ED[\ogp{Xt\X^^\Hk],  (28) 

t=i 

where  H  is  a  random  sample  from  {1, ...  —  +  ...  ,n},  whose  size  is  a  random  variable 

that  has  a  discrete  uniform  distribution  on  {0, 1, . . .  ,n  —  1}.  Dawid’s  result  (27)  implies 
that  this  is  asymptotically  equivalent  to  the  plug-in  maximum  likelihood  version, 

n 

SId  =  J2ED[logp{Xt\X^^\9i^\Hk],  (29) 

t=i 

where  is  the  MLE  of  9k  based  on  X^^\ 

The  formulations  (28)  and  (29)  may  be  useful  because  they  turn  a  score  that  was  a 
sum  of  non-identically  distributed  terms  into  one  that  is  a  sum  of  identically  distributed 
exchangeable  terms.  This  opens  the  possibility  of  evaluating  5^  b  D  Monte  Garlo, 
which  would  be  a  form  of  cross-validation.  In  this  cross-validation,  the  amount  of  data  left 
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out  would  be  random  rather  than  fixed,  leading  us  to  call  it  random-fold  cross-validation. 
Smyth  (2000)  used  the  log-likelihood  as  the  criterion  function  in  cross-validation,  as  here, 
calling  the  resulting  method  cross- validated  likelihood,  but  used  a  fixed  holdout  sample  size. 

These  results  suggest  that  random-fold  cross-validation  corresponds  to  a  proper  scoring 
rule,  asymptotically  as  the  amount  of  simulation  gets  large.  One  issue  in  cross-validation 
generally  is  how  much  data  to  leave  out,  and  different  choices  lead  to  different  versions  of 
cross-validation,  such  as  leave-one-out,  10-fold,  and  so  on.  Considering  versions  of  cross- 
validation  as  scoring  rules  may  shed  some  light  on  this  issue,  for  example  by  determining 
whether  or  not  they  are  proper.  We  are  not  aware  of  results  showing  that  other  versions  of 
cross-validation  correspond  to  proper  scoring  rules. 

We  have  seen  by  (25)  that  when  there  are  no  parameters  being  estimated,  the  Bayes 
factor  is  equivalent  to  the  log  score.  Thus  one  could  replace  the  log  score  by  another  proper 
score,  and  the  difference  in  scores  could  be  viewed  as  a  kind  of  “predictive  Bayes  factor” 
with  a  non-log  score.  In  Sk,Bi  Sk^BiCi  could  replace  the  terms 

in  the  sums  (each  of  which  has  the  form  of  a  log  score)  by  another  proper  score,  such  as 
the  continuous  ranked  probability  score,  and  we  conjecture  that  the  resulting  scores  are 
also  proper.  Then  we  would  have  a  way  of  generating  proper  scoring  rules  when  there  are 
parameters  being  estimated. 

7  Case  study:  Probabilistic  forecasts  of  sea-level  pressure 
over  the  North  American  Pacific  Northwest 

Operational  probabilistic  weather  forecasts  are  based  on  ensemble  prediction  systems.  En¬ 
semble  systems  typically  generate  a  set  of  perturbations  of  the  best  estimate  of  the  current 
state  of  the  atmosphere,  run  each  of  them  forward  in  time  using  a  numerical  weather  predic¬ 
tion  model,  and  use  the  resulting  set  of  forecasts  as  a  sample  from  the  predictive  distribution 
of  future  weather  quantities  (Palmer  2002). 

Grimit  and  Mass  (2002)  described  the  University  of  Washington  ensemble  prediction 
system  over  the  Pacific  Northwest  which  covers  Oregon,  Washington,  British  Columbia, 
and  parts  of  the  Pacific  Ocean.  This  is  a  five-member  ensemble  that  consists  of  distinct 
runs  of  the  MM5  numerical  weather  prediction  model  with  initial  conditions  taken  from 
distinct  national  and  international  weather  centers.  We  consider  48-hour  ahead  forecasts 
of  sea-level  pressure  in  January-June  2000,  the  same  period  as  that  on  which  the  work  of 
Grimit  and  Mass  was  based.  The  unit  used  is  the  millibar  (mb).  Our  analysis  builds  on  a 
verification  data  base  of  16  015  records  scattered  over  the  North  American  Pacific  North¬ 
west  and  the  aforementioned  six-month  period.  Each  record  consists  of  the  five  ensemble 
member  forecasts  and  the  associated  verifying  observation.  The  root-mean-square  error  of 
the  ensemble  mean  forecast  was  3.30  mb,  and  the  square  root  of  the  average  variance  of  the 
five-member  forecast  ensemble  was  2.13  mb,  resulting  in  a  ratio  of  1.55. 

The  underdispersive  behavior  —  observed  errors  that  tend  to  be  larger  on  average  than 
suggested  by  the  ensemble  spread  —  is  typical  of  ensemble  systems  and  seems  unavoidable, 
given  that  ensembles  capture  only  part  of  the  sources  of  uncertainty  (Raftery,  Balabdaoui, 
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Figure  1:  Probabilistic  sea- level  pressure  forecasts  over  the  North  American  Pacific  North¬ 
west  in  January-July  2000.  The  scores  are  shown  as  a  function  of  the  inflation  factor  r, 
where  the  predictive  density  is  taken  to  be  normal,  centered  at  the  ensemble  mean  fore¬ 
cast,  and  with  predictive  standard  deviation  equal  to  r  times  the  standard  deviation  of  the 
forecast  ensemble.  The  scores  were  subject  to  linear  transformations  as  detailed  in  Table  2. 


Gneiting  and  Polakowski  2003,  p.  4).  To  obtain  calibrated  predictive  probability  distri¬ 
butions,  it  thus  seems  necessary  to  carry  out  some  form  of  statistical  postprocessing.  One 
natural  approach  is  to  take  the  predictive  distribution  for  sea-level  pressure  at  any  given  site 
as  normal,  centered  at  the  ensemble  mean  forecast,  and  with  predictive  standard  deviation 
equal  to  r  times  the  standard  deviation  of  the  forecast  ensemble.  Density  forecasts  of  this 
type  were  proposed  by  Deque,  Royer  and  Stroe  (1994)  and  Wilks  (2002).  Following  Wilks, 
we  refer  to  r  as  an  inflation  factor. 

7.1  Evaluation  of  density  forecasts 

In  the  aforementioned  approach  the  predictive  density  is  Gaussian,  say  its  mean, 

fi,  is  the  ensemble  mean  forecast,  and  its  standard  deviation,  ra,  is  the  product  of  the 
inflation  factor,  r,  and  the  standard  deviation  of  the  five-member  forecast  ensemble,  a.  We 
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Table  2:  Probabilistic  sea-level  pressure  forecasts  over  the  North  American  Pacific  North¬ 
west  in  January-July  2000.  The  predictive  density  is  taken  to  be  normal,  centered  at 
the  ensemble  mean  forecast,  and  with  predictive  standard  deviation  equal  to  r  times  the 
standard  deviation  of  the  forecast  ensemble. 


Score 

arg  max,.  s(r) 
in  Eqn.  (30) 

Linear  Transformation 
in  Figure  1 

Logarithmic  score  (LogS) 

2.41 

s-h  13 

Spherical  score  (SphS) 

1.84 

108s  -h  86 

Quadratic  score  (QS) 

2.18 

40s  -|-  6 

Continuous  ranked  probability  score  (CRPS) 

1.62 

10s -h  8 

Linear  score  (LinS) 

105s  -  5 

Probability  score  (PS) 

60s  -  5 

considered  various  scoring  rules  S  and  computed  the  average  score, 

^  16  015 

■s(r)  =  16015  ,  r>0,  (30) 

i=l 

as  a  function  of  the  inflation  factor  r.  The  index  i  refers  to  the  i-th  record  in  the  verification 
data  base,  and  Xi  denotes  the  value  that  materialized.  Given  the  underdispersive  character 
of  the  ensemble  system,  we  expect  s(r)  to  be  maximized  at  some  r  >  1,  possibly  near  the 
observed  ratio  r  =  1.55  of  the  root-mean-square  error  of  the  ensemble  mean  forecast  over 
the  square  root  of  the  average  ensemble  variance. 

We  computed  the  mean  score  (30)  for  inflation  factors  r  G  (0, 5)  and  for  the  logarithmic 
score  (LogS),  spherical  score  (SphS),  quadratic  score  (QS),  continuous  ranked  probability 
score  (CRPS),  linear  score  (LinS),  and  probability  score  (PS),  as  defined  in  Sections  4.1  and 
4.2.  Briefly,  if  p  denotes  the  predictive  density  and  x  stands  for  the  observed  value,  then 


LogS(p,x) 

=  logp(x). 

SphS(p,  x) 

p{x) 

(roo(p(2/))^dy)V2  ’ 

QS(p,  x) 

=  "^Pix)  -  IZo{p{y)?  dy, 

CRPS(p,x) 

=  ^Ep\X  -  X'\-  Ep\X  -x 

LinS(p,  x) 

=  p{x), 

PS(p,  x) 

=  CTWd!,, 

Figure  1  and  Table  2  summarize  the  results  of  this  experiment.  The  scores  shown  in  the 
figure  are  linearly  transformed,  and  the  transformations  are  listed  in  the  right-hand  column 
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of  the  table.  In  the  case  of  the  spherical  score,  for  instance,  we  plotted  the  sum  of  108 
times  the  value  in  (30)  and  86.  Clearly,  propriety  is  preserved  under  the  transformation. 
The  logarithmic  score,  spherical  score,  quadratic  score,  and  continuous  ranked  probability 
score  were  maximized  at  values  of  r  that  were  larger  than  1,  thereby  confirming  the  un- 
derdispersive  character  of  the  ensemble.  These  scores  are  proper.  The  linear  score  and  the 
probability  score  were  maximized  at  r  =  0.05  and  r  =  0.02,  respectively,  thereby  suggesting 
ignorable  forecast  uncertainty  and  almost  deterministic  forecasts.  The  latter  two  scores 
have  intuitive  appeal,  and  the  probability  score  has  been  used  to  assess  forecast  ensembles 
(Wilson,  Burrows  and  Lanzinger  1999).  However,  they  are  improper  and  their  use  may 
result  in  misguided  scientific  inferences,  as  in  this  experiment. 

It  is  interesting  to  observe  that  the  logarithmic  score  gave  the  highest  maximizing  value 
of  r.  The  logarithmic  score  is  strictly  proper  but  involves  a  harsh  penalty  for  low  probability 
events  and  therefore  is  highly  sensitive  to  extreme  cases.  Our  verification  data  base  includes 
a  number  of  low  spread  cases  for  which  the  ensemble  variance  implodes.  The  logarithmic 
score  penalizes  the  resulting  predictions,  unless  the  inflation  factor  r  is  large.  Weigend  and 
Shi  (2000,  p.  382)  noted  similar  concerns  and  considered  the  use  of  trimmed  means  when 
computing  the  logarithmic  score.  In  our  experience,  the  continuous  ranked  probability  score 
is  less  sensitive  to  extreme  cases  or  outliers  and  provides  a  more  resistant  alternative. 

7.2  Evaluation  of  interval  forecasts 

The  aforementioned  predictive  densities  also  provide  interval  forecasts.  We  considered  the 
central  (1  —  a)  x  100%  prediction  interval  where  a  =  0.50  and  a.  =  0.10,  respectively.  The 
associated  lower  and  upper  prediction  bounds  li  and  Ui  are  the  §  and  1  —  f  quantiles  of 
the  normal  distribution  with  mean  Hi  and  standard  deviation  rcTj,  as  described  above.  We 
assessed  the  resulting  interval  forecasts  in  their  dependence  on  the  inflation  factor  r  in  two 
ways,  by  computing  the  empirical  coverage  of  the  prediction  intervals,  and  by  computing 

^  16  015 

Sa{r)  =  Sa{li,Ui;xi),  r  >  0,  (31) 

i=l 

where  Sa  denotes  the  interval  score  (19).  This  scoring  rule  assesses  both  calibration  and 
sharpness  —  the  latter  by  rewarding  narrow  prediction  intervals,  and  the  former  by  penal¬ 
izing  prediction  intervals  that  do  not  cover  the  observation.  Figure  2(a)  shows  the  empirical 
coverage  of  the  prediction  intervals.  Clearly,  the  coverage  increased  with  r.  If  a  =  0.50 
and  a  =  0.10  the  nominal  coverage  was  obtained  at  r  =  1.78  and  r  =  2.11,  respectively. 
This  confirms  the  underdispersive  character  of  the  ensemble.  Figure  2(b)  shows  the  interval 
score  (31)  as  a  function  of  the  inflation  factor  r.  If  a  =  0.50  and  a  =  0.10  the  score  was 
maximized  at  r  =  1.56  and  r  =  1.72,  respectively. 
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(a) 


Figure  2:  Interval  forecasts  of  sea- level  pressure  over  the  North  American  Pacific  Northwest 
in  January-July  2000:  (a)  Nominal  and  actual  coverage,  and  (b)  the  interval  score  (31),  for 
50%  central  prediction  intervals  (a  =  0.50,  broken  line)  and  90%  central  prediction  intervals 
(a  =  0.10,  solid  line).  The  predictive  density  is  Gaussian,  centered  at  the  ensemble  mean 
forecast,  and  with  predictive  standard  deviation  equal  to  r  times  the  standard  deviation  of 
the  forecast  ensemble. 
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8  Optimum  score  estimation 


Strictly  proper  scoring  rules  are  also  of  interest  in  estimation  problems,  where  they  provide 
attractive  loss  and  utility  functions  that  can  be  adapted  to  the  problem  at  hand. 


8.1  Point  estimation 


We  recall  the  generic  estimation  problem  described  in  the  introduction.  Suppose  that  we 
wish  to  fit  a  parametric  model  Pg  based  on  a  sample  Xi, . . . ,  of  identically  distributed 
observations.  To  estimate  6,  we  can  measure  the  goodness-of-fit  by  the  mean  score 


Sn{0) 


1 

n 


Y.S{Pe,X,) 

i=l 


where  S'  is  a  scoring  rule  that  is  (strictly)  proper  with  respect  to  a  convex  class  of  probability 
measures  that  contains  the  parametric  model.  If  9q  denotes  the  true  parameter  value, 
asymptotic  arguments  indicate  that 


aigmaxg  Sn{0)  ^  00  as  n^oo. 


(32) 


This  suggests  a  general  approach  to  estimation:  Choose  a  strictly  proper  scoring  rule  S  that 
is  tailored  to  the  scientific  problem  at  hand  and  take  =  aigmaxg  Sn{0)  as  the  optimum 
score  estimator  based  on  the  scoring  rule  S.  The  first  four  values  of  the  arg  max  in  Table 
2,  for  instance,  refer  to  the  optimum  score  estimate  for  the  inflation  factor  r  based  on 
the  logarithmic  score,  spherical  score,  quadratic  score  and  continuous  ranked  probability 
score,  respectively.  Pfanzagl  (1969)  and  Birge  and  Massart  (1993)  studied  optimum  score 
estimators  under  the  heading  of  minimum  contrast  estimators.  This  class  includes  many  of 
the  most  popular  estimators  in  various  situations  such  as  maximum  likelihood  estimators, 
least  squares  and  other  estimators  of  regression  models,  and  estimators  for  mixture  models 
or  deconvolution.  Pfanzagl  (1969)  proved  rigorous  versions  of  the  consistency  result  (32), 
and  Birge  and  Massart  (1993)  related  rates  of  convergence  to  the  entropy  structure  of 
the  parameter  space.  Maximum  likelihood  estimation  forms  the  special  case  of  optimum 
score  estimation  based  on  the  logarithmic  score,  and  optimum  score  estimation  forms  a 
special  case  of  M-estimation  (Huber  1964)  in  that  the  function  to  be  optimized  derives 
from  a  strictly  proper  scoring  rule.  When  estimating  the  location  parameter  in  a  normal 
population  with  known  variance,  for  example,  the  optimum  score  estimator  based  on  the 
continuous  ranked  probability  score  amounts  to  an  M-estimator  with  a  ^-function  of  the 
form  =  24>(^)  —  1,  where  c  is  a  positive  constant  and  ^  denotes  the  standard  normal 
cumulative.  This  provides  a  smooth  version  of  the  '0-function  for  Huber’s  (1964)  robust 
minimax  estimator;  see  Huber  (1981,  p.  208).  Asymptotic  results  for  M-estimators,  such  as 
the  consistency  theorems  of  Huber  (1967)  and  Perlman  (1972),  then  apply  to  optimum  scores 
estimators,  too.  Wald’s  (1949)  classical  proof  of  the  consistency  of  maximum  likelihood 
estimates  relies  heavily  on  the  strict  propriety  of  the  logarithmic  score,  which  is  proved  in 
his  Lemma  1. 


23 


The  appeal  of  optimum  score  estimation  lies  in  the  potential  adaption  of  the  scoring  rule 
to  the  problem  at  hand.  This  approach  has,  apparently,  only  very  recently  been  explored. 
Gneiting  et  al.  (2004)  estimated  a  predictive  regression  model  using  the  optimum  score  es¬ 
timator  based  on  the  continuous  ranked  probability  score  —  a  choice  that  was  motivated  by 
the  meteorological  problem  at  hand.  They  showed  empirically  that  such  an  approach  can 
yield  better  predictive  results  than  approaches  using  maximum  likelihood  plug-in  estimates. 
This  agrees  with  the  results  of  Copas  (1983)  and  Friedman  (1989)  who  showed  that  the  use 
of  maximum  likelihood  and  least  squares  plug-in  estimates  can  be  suboptimal  in  prediction 
problems.  Buja  et  al.  (2004)  proposed  the  use  of  strictly  proper  scoring  rules  in  classifica¬ 
tion  and  class  probability  estimation  problems  and  drew  links  to  Bayesian  techniques  and 
boosting. 

8.2  Interval  estimation 

We  now  turn  to  interval  estimation.  Casella,  Hwang  and  Robert  (1993,  p.  141)  pointed  out 
that 


“The  question  of  measuring  optimality  (either  frequentist  or  Bayesian)  of  a  set 
estimator  against  a  loss  criterion  combining  size  and  coverage  does  not  yet  have 
a  satisfactory  answer.” 

Their  work  was  motivated  by  an  apparent  paradox  due  to  J.  O.  Berger,  which  concerns 
interval  estimators  of  the  location  parameter  0  in  a  normal  population  with  unknown  scale. 
Let  !{•}  denote  an  indicator  function.  Under  the  loss  function 

L{I-e)=c\{I)-l{e  (33) 

where  c  is  a  positive  constant  and  A(/)  denotes  the  Lebesgue  measure  of  the  interval  estimate 
/,  the  classical  t-interval  is  dominated  by  a  misguided  interval  estimate  that  shrinks  to  the 
sample  mean  in  the  cases  of  the  highest  uncertainty.  Casella  et  al.  (1993,  p.  145)  commented 
that  “we  have  a  case  where  a  disconcerting  rule  dominates  a  time  honored  procedure.  The 
only  reasonable  conclusion  is  that  there  is  a  problem  with  the  loss  function.”  We  concur, 
and  we  propose  the  use  of  strictly  proper  scoring  rules  to  assess  interval  estimators  using  a 
loss  criterion  that  combines  width  and  coverage. 

Specifically,  we  contend  that  a  meaningful  comparison  of  interval  estimators  requires 
either  equal  coverage  or  equal  width.  The  loss  function  (33)  applies  to  all  set  estimates, 
regardless  of  coverage  and  size,  which  seems  unnecessarily  ambitious.  As  an  alternative,  we 
restrict  attention  to  interval  estimators  with  equal  nominal  coverage  and  use  the  (negative 
of  the)  interval  score  (19).  This  loss  function  can  be  written  as 

L«(/;0)  =  2aA(/) +4inf  |0-7?|,  (34) 

rj^I 

and  applies  to  interval  estimates  with  upper  and  lower  exceedance  probability  ^  x  100%. 
This  approach  can,  again,  be  traced  back  to  Dunsmore  (1968)  and  Winkler  (1972)  and 
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avoids  paradoxes,  as  a  consequence  of  the  propriety  of  the  interval  score.  When  compared 
to  (33),  the  loss  function  (34)  provides  a  more  flexible  assessment  of  the  coverage,  by  taking 
account  of  the  distance  between  the  interval  estimate  and  the  estimand. 


Appendix 

In  this  appendix,  we  compare  Theorem  2.1  to  a  more  direct  extension  of  the  McCarthy 
(1956)  and  Hendrickson  and  Buehler  (1971)  characterization.  The  results  of  McCarthy  and 
Hendrickson  and  Buehler  differ  from  Theorem  2.1  by  considering  functions  on  the  convex 
cone  T)  =  {XP  :  P  G  'P,X  >  0}.  Furthermore,  Hendrickson  and  Buehler  (1971)  assumed 
that  the  convex  class  V  is  dominated  by  a  cr-finite  measure  fj,  on  (H,^).  This  assumption 
can  be  disposed  of  as  follows.  A  function  H  :  H  ^  M  is  homogeneous  if  H{XP)  =  XH{P) 
for  all  P  G  P.  If  there  exists  a  P  G  P  and  a  "P-quasuntegrable  function  P[*(P,  ■)  :  H  ^  M 
such  that 

H{Q)  >H{P)  +  J  H*{P,uj)  dQ{u;)  -  J  H*{P,u)  dP{u;) 

for  all  Q  G  P,  then  H*{P,  ■ )  is  said  to  be  a  subtangent  of  H  relative  to  P  at  the  point  P  G  P. 
Following  Hendrickson  and  Buehler,  it  is  easy  to  show  that  a  scoring  rule  5  :  "P  x  H  ^  M  is 
(strictly)  proper  if  and  only  if 

S{P,uj)  =  H*{P,u)  (35) 

for  P  G  V  and  lv  G  P,  where  H  :  P  ^  M  is  homogeneous,  (strictly)  convex  on  P,  and  such 
that  P[*(P,  • )  is  a  subtangent  of  H  relative  to  P  at  P. 

In  the  case  of  a  finite  sample  space  of  size  m,  the  subtangent  H*{P,  ■ )  in  the  Hendrickson- 
Buehler  characterization  (35)  can  be  identified  with  the  subgradient  of  a  convex  function 
on  The  representation  (4)  suggests  the  characterization  of  Savage  (1971)  who  instead 
considered  convex  functions  on  the  unit  simplex  in  as  detailed  in  Section  3.  That 

said,  the  representations  (4)  and  (35)  are  equivalent  and  closely  related.  If  S  is  of  the  form 
(4)  with  an  associated  convex  function  G  :  P  ^  M,  define  H  :  P  — >  M  by  H(AP)  =  AG(P) 
for  P  gV.  Then  (35)  holds,  since  H*{P,  ■ )  =  G*{P,  • )  is  a  subtangent  of  H  relative  to  P 
at  P  G  P.  Conversely,  suppose  that  S  is  of  the  form  (35)  with  an  associated  homogeneous 
function  H  :  P  ^  M.  Then  S{P,P)  =  H{P)  for  P  G  P  and  the  representation  (4)  holds 
where  G  :  P  ^  M  is  the  restriction  of  P  to  P  and  G*  (P,  • )  =  H*  (P,  • )  is  a  subtangent  of  G 
at  P  G  P. 
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